Assessing Read Quality
Cleaning reads
In the last episode, we took a high-level look at the quality of each of our samples using FastQC. We visualised per-base quality graphs showing the distribution of the quality at each base across all the reads from our sample. This information helps us to determine the quality threshold we will accept, and thus, we saw information about which samples fail which quality checks. Some of our samples failed quite a few quality metrics used by FastQC. However, this does not mean that our samples should be thrown out! It is common to have some quality metrics fail, which may or may not be a problem for your downstream application. For our workflow, we will remove some low-quality sequences to reduce our false-positive rate due to sequencing errors.
To accomplish this, we will use a program called Trimmomatic. This useful tool filters poor quality reads and trims poor quality bases from the specified samples.
Trimmomatic options
Trimmomatic has a variety of options to accomplish its task.
| Option | Meaning |
|---|---|
ILLUMINACLIP |
Cut adapter and other illumina-specific sequences from the read. |
SLIDINGWINDOW |
Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. |
MINLEN |
Drop the read if it is below a specified length. |
LEADING |
Cut bases off the start of a read, if below a threshold quality. |
TRAILING |
Cut bases off the end of a read, if below a threshold quality. |
CROP |
Cut the read to a specified length. |
HEADCROP |
Cut the specified number of bases from the start of the read. |
AVGQUAL |
Drop the read if the average quality is below a specified value. |
MAXINFO |
Trim reads adaptively, balancing read length and error rate to maximise the value of each read. |
First, we must specify whether we have reads that are paired-end, single-end or a paired-end collection. Next, we will specify whether to perform ILLUMINACLIP. For our reads we want to perform adapter removal, using TruSeq3.
We can use the default parameters. Next we will chose which “Trimmomatic Operation” we want to use. You can use multiple operation. First we will use the SLIDINGWINDOW operation, using 4 bases to average across and a average quality score of 20.
Finally, we want to chose an additional “Trimmomatic Operation”, MINLEN. This operation will drop the read if it is below a specified length. Use your FastQC results to determine what this length should be. Ideally, you would want to ensure you keep reads that are a useful length, but you don’t want to drop too many e.g. if the average length of your reads is around 150 bp, you wouldn’t want to drop reads below 149 bp. This might remove all your reads, instead using a conservative number of 100-120 bp would be better.
Although we will use only a few options and trimming steps in our analysis, understanding the steps you are using to clean your data is essential. For more information about the Trimmomatic arguments and options, see the Trimmomatic manual.
Running Trimmomatic
Now we have an understanding of the parameters we can use with Trimmomatic, we can run the tool. Make sure to select the Output trimmomatic log messages option, as this will provide useful information regarding what Trimmommatic actually did.
Settings SLIDINGWINDOW and MINLEN
Once Trimmomatic completes, you will have 5 outputs:
- Trimmomatic on X data (R1 paired)
- Trimmomatic on Y data (R2 paired)
- Trimmomatic on X data (R1 unpaired)
- Trimmomatic on Y data (R2 unpaired)
- Trimmomatic on X and Y data (log file)
The reads we are interested in for this analysis are the paired outputs and we are also interested in the log file.
Editing Dataset Attributes
You have probably noticed by now that the names of our files are beginning to be long and difficult to decipher. Therefore, we should edit the data attributes of our files, to give more descriptive names.
To do this, click the “pencil” icon and edit the name, then click “Save”.
I’d recommend using the sample name, as this is descriptive and therefore provides more context for the reads. Don’t forget to include information regarding the processing the reads have undergone e.g. add _trimmed to the name (XXXX_trimmed_1.fastq).
Checking the Impact of Trimmomatic
To assess the impact Trimmomatic had on our reads, we can rerun FastQC.
From the comparison we can see that the metabarcoding sample (B_sample_98) hasn’t changed much compared to before trimming, the reads are still failing the same FastQC tests as previously.
As we can explain the results we are observing from the FastQC, we can proceed.